In data science,sometimes in real world datasets, there can be values missing due to many reasons(). An easy solution for it is just remove those datapoint from the dataset. But if the missing values are significant , then we have to fill those missing values before applying any machine learning model .
Ignoring missing values can have a negative effect on the accuracy of the model. So its good practice to fill/impute those missing values as much as possible.
Missing values can be both categorical or numerical.They are different methods to handle missing values .For our example, we focus on handling numerical type values.
To demonstrate filling up missing values in numerical data, we will look at an example using the pima.csv dataset. The dataset contains data about the Pima people which predicts their posibility of having diabetes based on the features Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction and Age.
The pima dataset has lots of missing values. In this example, we will find the missing values in the dataset and fill the missing values based on their distribution.
Our first task is to load the pandas library and the pima.csv dataset into a dataframe.
The steps are:
pandas library pima using read_csv method in pandas pima. import pandas as pd
pima= pd.read_csv("pima_updated_28Jan.csv")
pima.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148.0 | 72.0 | 35.0 | NaN | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85.0 | 66.0 | 29.0 | NaN | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183.0 | 64.0 | NaN | NaN | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 |
Our next task is to check where the null values exists in the dataset. We check all the columns to find the null values in the dataset. For this task we use the isnull() method.
The steps are:
isnull() method to get all the null values in pima variable.sum() method to get the null value count in each column.pima.isnull().sum()
Pregnancies 0 Glucose 5 BloodPressure 35 SkinThickness 227 Insulin 374 BMI 11 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64
From the output shown, If we look closely at the null values in each column, columns SkinThickness and Insulin have a lot of null values.
For this reason , in this task, we focus on filling the missing/NULL values in the Insulin and SkinThickness columns.
To fill missing values in numerical data, aggregation methods such as mean , median are used in general cases.
For Skin_Thickness, to fill the missing values, our first task is to see the distribution of its data.
For this, we can use a histogram to observe its distribution. We will use the plotly library to use the histogram.
The steps are:
plotly libraryhistogram() method in variable fig to draw the histogrampima, where the data is storedSkinThickness, whose histogram will be shownnbinsshow methodimport plotly.express as px
fig= px.histogram(pima, x='SkinThickness', nbins=50)
fig.show()
From the histogram above, we can see the data distribution of data is a normal distribution. In this type of scenarios, to impute the missing values in data, we usually use the mean of the data(Skin_Thickness) column since the values are centered not skewed.
Similarly, for Insulin value, we look at the distribution of its data.
The steps are:
histogram method in variable fig2 to draw the histogrampima, where the data is storedInsulin, whose histogram will be shownnbinsshow methodfig2= px.histogram(pima, x='Insulin', nbins=50)
fig2.show()
From the histogram, we can see that the distibution of Insulin data is right skewed meaning most of the datapoints are on the left side. In this type of scenario, its better to fill the missing values using the median of the data(Insulin) since most of the value will also be on the left side of the Insulin data.
Based on the distribution of data, we fill their missing values with mean and median of their existing values.
SkinThickness missing values:¶The steps are:
fillna method to fill the NULL valuesSkinThickness column data, using the mean methodinplace and set value as True to make the changes permanentSkinThickness values using the head methodpima['SkinThickness'].fillna(pima['SkinThickness'].mean(), inplace=True)
pima.head(10)
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148.0 | 72.0 | 35.00000 | NaN | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85.0 | 66.0 | 29.00000 | NaN | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183.0 | 64.0 | 29.15342 | NaN | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89.0 | 66.0 | 23.00000 | 94.0 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137.0 | 40.0 | 35.00000 | 168.0 | 43.1 | 2.288 | 33 | 1 |
| 5 | 5 | 116.0 | 74.0 | 29.15342 | NaN | 25.6 | 0.201 | 30 | 0 |
| 6 | 3 | 78.0 | 50.0 | 32.00000 | 88.0 | 31.0 | 0.248 | 26 | 1 |
| 7 | 10 | 115.0 | NaN | 29.15342 | NaN | 35.3 | 0.134 | 29 | 0 |
| 8 | 2 | 197.0 | 70.0 | 45.00000 | 543.0 | 30.5 | 0.158 | 53 | 1 |
| 9 | 8 | 125.0 | 96.0 | 29.15342 | NaN | NaN | 0.232 | 54 | 1 |
| 10 | 4 | 110.0 | 92.0 | 29.15342 | NaN | 37.6 | 0.191 | 30 | 0 |
| 11 | 10 | 168.0 | 74.0 | 29.15342 | NaN | 38.0 | 0.537 | 34 | 1 |
| 12 | 10 | 139.0 | 80.0 | 29.15342 | NaN | 27.1 | 1.441 | 57 | 0 |
| 13 | 1 | 189.0 | 60.0 | 23.00000 | 846.0 | 30.1 | 0.398 | 59 | 1 |
| 14 | 5 | 166.0 | 72.0 | 19.00000 | 175.0 | 25.8 | 0.587 | 51 | 1 |
| 15 | 7 | 100.0 | NaN | 29.15342 | NaN | 30.0 | 0.484 | 32 | 1 |
| 16 | 0 | 118.0 | 84.0 | 47.00000 | 230.0 | 45.8 | 0.551 | 31 | 1 |
| 17 | 7 | 107.0 | 74.0 | 29.15342 | NaN | 29.6 | 0.254 | 31 | 1 |
| 18 | 1 | 103.0 | 30.0 | 38.00000 | 83.0 | 43.3 | 0.183 | 33 | 0 |
| 19 | 1 | 115.0 | 70.0 | 30.00000 | 96.0 | 34.6 | 0.529 | 32 | 1 |
Insulin missing values:¶The steps are:
Insulin column from pima.fillna method to fill the NULL valuesInsulin column data, using the median methodinplace and set value as True to make the changes permanentInsulin values using the head methodpima['Insulin'].fillna(pima['Insulin'].median(), inplace=True)
pima.head(10)
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148.0 | 72.0 | 35.00000 | 125.0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85.0 | 66.0 | 29.00000 | 125.0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183.0 | 64.0 | 29.15342 | 125.0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89.0 | 66.0 | 23.00000 | 94.0 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137.0 | 40.0 | 35.00000 | 168.0 | 43.1 | 2.288 | 33 | 1 |
| 5 | 5 | 116.0 | 74.0 | 29.15342 | 125.0 | 25.6 | 0.201 | 30 | 0 |
| 6 | 3 | 78.0 | 50.0 | 32.00000 | 88.0 | 31.0 | 0.248 | 26 | 1 |
| 7 | 10 | 115.0 | NaN | 29.15342 | 125.0 | 35.3 | 0.134 | 29 | 0 |
| 8 | 2 | 197.0 | 70.0 | 45.00000 | 543.0 | 30.5 | 0.158 | 53 | 1 |
| 9 | 8 | 125.0 | 96.0 | 29.15342 | 125.0 | NaN | 0.232 | 54 | 1 |
| 10 | 4 | 110.0 | 92.0 | 29.15342 | 125.0 | 37.6 | 0.191 | 30 | 0 |
| 11 | 10 | 168.0 | 74.0 | 29.15342 | 125.0 | 38.0 | 0.537 | 34 | 1 |
| 12 | 10 | 139.0 | 80.0 | 29.15342 | 125.0 | 27.1 | 1.441 | 57 | 0 |
| 13 | 1 | 189.0 | 60.0 | 23.00000 | 846.0 | 30.1 | 0.398 | 59 | 1 |
| 14 | 5 | 166.0 | 72.0 | 19.00000 | 175.0 | 25.8 | 0.587 | 51 | 1 |
| 15 | 7 | 100.0 | NaN | 29.15342 | 125.0 | 30.0 | 0.484 | 32 | 1 |
| 16 | 0 | 118.0 | 84.0 | 47.00000 | 230.0 | 45.8 | 0.551 | 31 | 1 |
| 17 | 7 | 107.0 | 74.0 | 29.15342 | 125.0 | 29.6 | 0.254 | 31 | 1 |
| 18 | 1 | 103.0 | 30.0 | 38.00000 | 83.0 | 43.3 | 0.183 | 33 | 0 |
| 19 | 1 | 115.0 | 70.0 | 30.00000 | 96.0 | 34.6 | 0.529 | 32 | 1 |
After filling the missing values, lets check if there are any missing values in the SkinThickness and Insulin columns.
The steps are:
SkinThickness,Insulin column data from the pima variable.isnull() method to get the NULL values in those columns.sum() method to sum the null value counts in those columns.pima[['SkinThickness','Insulin']].isnull().sum()
SkinThickness 0 Insulin 0 dtype: int64
We have filled out missing values of numerical type data based on their distribution.